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Abstract. We consider the source separation problem for single-channel 
music signals. After a brief review of existing methods, we focus on de- 
composing a mixture into components made of harmonic sinusoidal par- 
tials. We address this problem in the Bayesian framework by building a 
probabilistic model of the mixture combining generic priors for harmonic- 
ity, spectral envelope, note duration and continuity. Experiments suggest 
that the derived blind decomposition method leads to better separation 
results than nonnegative matrix factorization for certain mixtures. 


1 Introduction 


1.1 Constrained specific models and unconstrained generic models 


Single-channel musical source separation is the problem of extracting the source 
signals (s;(t))1<;<y underlying a music signal x(t) = ae sj(t). This problem 
can be addressed by building appropriate models of the sources. The source 
models proposed in the literature rely on different amounts of prior information. 

Some methods exploit constrained source models representing the sources 
in a specific mixture with a good accuracy. For example, methods based on 
sparse coding with a fixed dictionary [1] or on factorial hidden Markov mod- 
els [2] typically assume that the source models can be learnt on segments of 
the mixture where only one source is present. These methods provide very good 
separation results, given the difficulty of the problem, but until now they rely 
on knowing the instruments present in the mixture and performing a manual 
segmentation. Other methods based on Computational Auditory Scene Analy- 
sis (CASA) with instrument templates [3] or on hybrid source models [4] rely 
on instrument-specific timbre properties learnt on a database of isolated notes. 
These methods also perform satisfyingly, but they cannot be applied when some 
of the instruments present in the mixture are not part of the learning database. 

By contrast, other methods rely on unconstrained generic source models ap- 
plicable to a large range of mixtures. For example, Nonnegative Matrix Factor- 
ization (NMF) decomposes the mixture short-term magnitude spectrum into a 
sum of components modeled by a fixed magnitude spectrum and a time-varying 
gain, assuming no constraints about the spectra and the gains except positiv- 
ity [5]. Source separation can then be achieved by clustering the components 


into sources, provided each component belongs to a single source. Good results 
based on automatic clustering have been reported for the separation of vocals [6] 
or drums [7] from real mixtures. Other studies using a manual clustering have 
shown that NMF can be used to separate real mixtures of non-percussive instru- 
ments [8]. However the NMF source model is not adapted to certain types of 
mixtures, such as those involving notes with time-varying fundamental frequency, 
instruments with similar spectral envelope or instruments playing synchronously. 


1.2 Harmonicity as a precise generic model 


In this paper, we assume that each musical note is a near-periodic signal con- 
taining harmonic sinusoidal partials. Harmonicity means that at each instant 
the frequencies of the partials are multiples of a single fundamental frequency. 
This assumption is true for sustained instruments such as bowed strings and 
winds and approximately true for many other instruments. It is false for drums, 
human voice and other noisy or transient sounds. Harmonicity can thus be seen 
as a precise generic model: it gives more information about the sources than the 
NMF model while being valid for a large range of mixtures. In the following, we 
call harmonic component a set of harmonic partials having common onset and 
offset times and we address the problem of Harmonic Component Extraction 
(HCE), that is the decomposition of a mixture into such components. We do not 
discuss the difficult issue of clustering the estimated components into sources. 

Most existing HCE methods consist in performing a polyphonic pitch track- 
ing, that is transcribing the fundamental frequencies of the notes present in the 
mixture, and then estimating the amplitudes and phases of their harmonics. 
Methods exploiting harmonicity only [9] are insufficient for source separation. 
Indeed harmonicity does not provide enough information to segregate partials 
from different sources overlapping at the same frequency. Other methods have 
used complementary assumptions of spectral continuity [10,11] and temporal 
continuity [12,10] to this aim. Since polyphonic pitch tracking is a difficult prob- 
lem for which no current algorithm provides a perfect solution, the separation 
performance of these methods was mostly evaluated based on prior knowledge 
of the fundamental frequencies and few quantitative results were reported. 

In the following, we recast the problem of estimating harmonic components 
in the Bayesian framework. We model the mixture signal as a sum of harmonic 
components whose parameters are governed by probabilistic priors and we es- 
timate the number of components and their parameters using a Maximum A 
Posteriori (MAP) criterion. This can be seen as a coherent approach where 
polyphonic pitch tracking and estimation of the amplitudes and phases of the 
partials are performed using the same model. The proposed model is inspired by 
Bayesian harmonic models introduced previously in the literature for polyphonic 
pitch transcription [13] but it includes several modifications. Most importantly, 
we design a perceptually motivated residual prior and we learn the parameters 
of other priors on a database of isolated notes rather than setting them manually 
to arbitrary values. When this learning database is large, the resulting model is 
generic. We have also used this model recently for object coding purposes [14]. 


The rest of the paper is structured as follows. Section 2 presents the gen- 
erative model of the mixture and the associated inference algorithm. Section 
3 compares the performance of the proposed method with NMF on a few test 
mixtures. Finally Section 4 discusses some future research directions. 


2 Bayesian inference of harmonic components 


2.1 Signal model 


The proposed model is expressed in the time domain. Let x,(t) be the n-th 
frame of the mixture signal x(t) defined by x, (t) = w(t)x(nS +t) where w(t) is 
a Hanning window of length W and S is the stepsize. We develop x,,(t) as 


Enlt) = >> Sen(t) + ent), (1) 
CECn 


where (Sen(t))cec,, are the harmonic components present in this frame and en (t) 
is the residual. We define each harmonic component, which generally spans sev- 
eral time frames, by 
Me 
Sen(t) = w(t) X demn COS(27M fent + emn), (2) 


mæl 


where fen is its fundamental frequency and (demn,Pemn) are the time-varying 
amplitude and phase of its m-th partial in the n-th frame. 


2.2 Frequency, amplitude and spectral envelope priors 


We associate each component with a latent fundamental frequency Fe belonging 
to the MIDI scale, which is the discrete 1/12 octave scale used for western musical 
scores. We constrain the number of partials Me of the c-th component to 


Me = min((Finax/F-), Max), (3) 


where Fmax is the Nyquist frequency and Mmax is set to 60. On each time frame, 
we model the fundamental frequency by a log-Gaussian prior 


P(log fen) = N (log fen; log Fe, 07), (4) 


where \V(-; 4,0) is the univariate Gaussian density of mean u and standard 

deviation ø. In order to help estimate the amplitudes of the partials when partials 

from several notes overlap at the same frequency, we describe the amplitudes as 

the product of a fixed normalized spectral envelope (u%,,)1<m<m., à latent 
log-Gaussian amplitude factor ren and a log-Gaussian residual, that is 

P(log demn|Ten) = N (log demn} log (ten Mem)» CR); (5) 

P(log ren) = N (log ren: Mr, OF, )- (6) 


Finally we assume that the phases of the partials are uniformly distributed 


P(%emn) = 1/(2m). (7) 


2.3 Duration and continuity priors 


Perceptually annoying discontinuities may appear in the extracted source signals 
when the model parameters are estimated on each time frame separately. Thus 
we add duration and continuity priors on the parameters. We associate each 
point on the MIDI scale with a binary activity state in each frame determining 
whether a harmonic component with the corresponding latent frequency Fe is 
being played or not in that frame, with the constraint that different instruments 
cannot play notes with the same latent frequency at the same time. We assume 
that the sequences of activity states for different points on the MIDI scale are 
independent, and we model each sequence by a two-state Markov prior. We also 
set temporal continuity priors on the frequencies and amplitudes of the partials 


P(log fen|fen—1) = N (log Tens log fen-1; of ), (8) 
P(log QAcmn |Gem,n—1) = N (log QAemn; log QAem,n—1) Th m), (9) 
Postuler = N (log ren; log ren—1, op). (10) 


The global prior on amplitudes and frequencies is then defined up to a multi- 
plicative constant by multiplying these priors with the local priors defined above. 


2.4 Perceptually motivated residual prior 


The role of the prior on the residual is to ensure that the largest possible number 
of notes present in the mixture are extracted using a given number of compo- 
nents. The standard Gaussian prior measures the distortion between the mixture 
signal and the model according to the energy of the residual. This often results in 
several components being used to represent high-energy notes, while low energy 
parts of the mixture such as low energy notes, onsets and reverberation are not 
transcribed despite their perceptual significance. We design instead a weighted 
Gaussian prior inspired from the distortion measures proposed in [15,16] which 
give a larger weight to perceptually significant low energy parts. 

The proposed prior models the first stages of auditory processing. The in- 
coming sound first passes through the outer and the middle ear and is split by 
the cochlea into several frequency subbands called auditory bands. The energy 
in each auditory band is then transformed nonlinearly into a loudness value 
taking into account masking phenomena. More precisely, we measure the power 
of the residual in the b-th auditory band by Ene = > ve9fl En, where 
(Enp)o<s<w-1 are the Fourier transform coefficients of e, (4), (Vaf )o<f<w/2 are 
coefficients modeling the frequency spread of that band and (gf)o<f<w/2 is the 
frequency response of the outer and middle ear. We measure similarly the power 
of the mixture signal in that band by Xn = pais vo gf\|Xngl?- Then we define 


the distortion due to the residual on the n-th frame by Ln = + EroX, 
It can be shown that this distortion is approximately equal to the perceived 
loudness of the residual on that frame [16]. We derive the residual prior from the 
distortion by P(en) x exp(—Ln/(20€?)). This prior can also be expressed as 


P(En) = N(Engi 0,0g) (11) 


where 


D w/2 —0.75 
mp = do vrss | X vor gr|Xngl? (12) 
b=1 f=0 


2.5 Approximate inference of harmonic components 


The signal model and the parameter priors define together a probabilistic gen- 
erative model of the mixture signal that is used to infer the MAP values of 
the activity states and the frequency, amplitude and phase parameters repre- 
senting a given mixture. Due to the complexity of the model, exact inference 
is intractable. We therefore use a three-step approximate inference procedure 
instead. First we estimate the MAP activity states and the corresponding MAP 
parameters on each time frame separately, then we refine the estimation of the 
states by adding the duration priors, and finally we refine the estimation of the 
parameters by keeping the states fixed and adding the continuity priors. More 
details about these steps are given in [14]. Each harmonic component is then 
directly synthesized from the corresponding parameters. 


3 Evaluation 


3.1 Training, performance measure and optimal clustering 


We evaluate the proposed HCE method on test mixtures sampled at 22.05 kHz. 
Hyper-parameters of the generative model are set to the same values for all test 
mixtures: of, (u% m), (oh) (ur), (om): of", (622) and (oh) are learnt on 
part of the RWC+ Musical Instrument Database whereas o° and the Markov 
transition probabilities are set manually. The frame parameters are set to W = 
1024 (46 ms) and S = 512 (23 ms) and discrete fundamental frequencies span 
the range between MIDI 36 (65 Hz) and MIDI 100 (2640 Hz). 

For comparison purposes, we also evaluate NMF on the same test mixtures. 
We write the NMF generative model as |Xnf| = si Pefden + Eng, Where 
(Def )o<<w/2 and (den)o<n<N-1 are the fixed spectrum and time-varying am- 
plitude of the c-th nonnegative component respectively. We assume that these 
quantities are positive and that the residual Ens follows the weighted Gaussian 
prior above. The total number of spectra C is fixed manually and the spectra 
and time-varying amplitudes are estimated using multiplicative update rules. 
Source signals including several spectra are then synthesized by inverse Fourier 
transform and overlap-add using the phase spectrum of the mixture signal. This 
algorithm is similar to the weighted NMF algorithm introduced in [16], except 
the definition of the time-frequency weights (ynf) is modified by taking into 
account overlap between auditory bands. 

For evaluation purposes, we partition components produced by HCE or NMF 
into source clusters based on prior knowledge of the true sources. We define the 


l'http://staff.aist.go.jp/m.goto/RWC-MDB/ 


optimal clusters as those which maximize the overall source separation perfor- 
mance and we compute them using a beam search procedure. This “oracle” clus- 
tering is not feasible in realistic situations, however it allows the measurement 
of the best source separation quality potentially achievable. 

The source separation performance is measured locally for each estimated 
source j around each time frame n using a local phase-blind Signal-to-Distortion 
Ratio (SDR) in decibels (dB) defined by 


W'-1 ; 2 2 
_o WY Sin, 
SDRjn al 10 logy | Wici =o 2 A Int E ) ) (13) 
ixo W(P (Snes — Sines] 








where w'(l) is a Hanning window of length W” = 12 frames and (Sjnf) and 
(Sins) are the short-term Fourier transforms of the j-th estimated source and 
the j-th true source respectively. The overall performance is measured by a global 
SDR defined as the median of local SDRs for all sources and all time frames. 
We believe that this performance measure accounts better for subjective effects 
than the standard time-domain SDR. Indeed the ear is approximately phase- 
blind and the error perceived at a given time depends only on the power of the 
target signal at that time, not on its total energy. However the actual subjective 
performance is better assessed by listening to the estimated source signals. 


3.2 Results 


We consider two sets of test mixtures: ten mixtures of two sources using real 
sources from the SQAM database”, and ten MIDI-synthesized mixtures from 
the RWC Classical Music and Music Genre Databases containing two to five 
sources. We set the number of nonnegative components of NMF to be the same 
as the number of harmonic components estimated by HCE. This allows a rather 
fair comparison of the two methods, since in a blind context the difficulty of com- 
ponent clustering would depend on the number of components. We also separate 
MIDI-synthesized mixtures by HCE using knowledge of the note activity states. 
All the mixture signals and some of the estimated source signals are available 
for listening on http: //www.elec.qmul.ac.uk/people/emmanuelv/ICA06/. 

Table 1 shows that the global SDR achieved by HCE is on average 3 dB higher 
than NMF on mixtures of real sources and 6 dB higher on MIDI-synthesized mix- 
tures. Informal listening tests suggest that the estimation errors made by the two 
methods are very different. As expected, NMF often fails to separate synchro- 
nized notes in MIDI-synthesized mixtures because these notes have the same 
temporal evolution. This results in strong interference or in continuous artifacts. 
More surprisingly, NMF also produces artifacts on mixtures of real sources which 
are not synchronized. By contrast, HCE generally produces fewer artifacts, but 
some interference appears locally due to simultaneous or successive notes with 
the same frequency being fused into a single component, or to harmonic partials 
from different sources being transcribed as part of the same component. 


2? http: //www.ebu.ch/en/technical/publications/tech3000_series/tech3253/ 


Table 1. Comparison of the separation performance achieved by HCE and NMF. 

















Separation method Global SDR on various mixtures of real sources (dB) 
HCE 12.9 13.3 13.8 10.9 19.3 17.3 14.6 15.2 14.7 11.9 
NMF 10.3 11.6 12.6 7.2 14.0 11.8 10.9 11.9 13.0 10.4 
Separation method Global SDR on various MIDI-synthesized mixtures (dB) 
HCE with true score | 14.5 29.3 10.8 13.0 10.4 3.0 12.3 9.4 17.7 8.8 
HCE 15.4 29.3 7.2 11.9 10.7 5.5 11.5 3.2 17.0 5.1 
NMF 6.9 74 5.7 70 14 35 34 26 145 3.3 














The knowledge of the note activity states does not substantially improve 
the performance of HCE for seven out of ten MIDI-synthesized mixtures?. It 
is interesting to note that the number of notes estimated by HCE on MIDI- 
synthesized mixtures is on average 2.5 times larger than the actual number of 
notes being played. Most of the spurious notes have short duration and are due to 
the system trying to represent non-harmonic parts of the signal using harmonic 
components, which does not seem to affect the separation performance. 

Other experiments suggested that the performance of NMF decreases when 
more components are allowed and does not change significantly when initializing 
the NMF basis spectra by the spectra of the harmonic components estimated by 
HCE. Thus the limited performance of NMF on the test mixtures seems to be 
the effect of the model itself rather than algorithmic issues. 


4 Conclusion 


In this paper, we address the blind source separation problem for single-channel 
musical mixtures where the notes are near-periodic signals containing harmonic 
sinusoidal partials. The proposed method, which exploits harmonicity and other 
generic source priors, performs better than NMF on various test mixtures. This 
suggests that the NMF model is not sufficiently constrained to ensure that typical 
audio source properties hold for the separated sources and that more precise 
generic source models can help separation without needing specific information 
about a particular mixture. 

The main limitation of HCE is that it cannot deal with mixtures containing 
voice or drum instruments. This limitation could be addressed using a three- 
component generative model including probabilistic models for wideband noise 
components and transient components, in the spirit of the CASA system pro- 
posed in [12]. The proposed model could also be improved by adding slightly 
inharmonic components to represent instruments such as piano or guitar or by 
performing automatic adaptation of the probabilistic priors to the mixture to 
increase their precision and help reduce separation errors. 


3 For some mixtures the estimated note activity states lead to a better SDR than 
the true states because the perceptual weights used for decomposition are not taken 
into account for evaluation. In practice, the subjective performance of HCE using 
the true note activity states is always larger or equal to that of blind HCE. 
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